Clustered Index
General Info What An index is clustered if the order of the data records is the same as or close to the order of the index data entries Details *A file can be clustered on at most one search key *An index organized file implies that there is a clustered index at play, but the opposite is not true Unclustered Index In an unclustered index, the order of the data records and the order of the index data entries do not correspond at all. For unclustered indexes, we typically manage two files. One is the index file, and the other holds the actual records. Operation Costs on Alt-2 Unclustered Tree Index For the calculations, we will use the following values: *B: The number of data pages *R: The number of records per page *D: (Average) time to read or write disk page *F: Fan out of the index organized file *Assume 67% occupancy for index pages *Assume 100% occupancy for data pages *Assume that index data entries are 1/3 the size of actual data entries *Assume that the index is on a heap file Scan In this case, we ignore the index because the cost would be too large. If, hypothetically we used the index to scan, we could end up doing an IO for each data entry in the index and also with the additional cost of scanning the index. Equality Search This is similar to the unclustered version except we must add an extra page read because the actual data is not in the index file. The cost ends up being (log_F.5B + 1)D . We note that the .5 term is there because the index data entries are smaller by 1/3 and therefore, there are, in total .5B total pages of index data entries. Range Search For range search, we must find the starting place, and do sequential scans of the index to retrieve all the correct values. However, for the alternative 2 unclustered tree index, we have an additional overhead of retrieving the actual data pages as well. Therefore, the worst case cost is + \textrm{num~matching~idx~pages} - 1 + \textrm{num~matching~records}*D . As you can see, this number gets very large very quickly. As a rule of thumb, if 10 percent of data records satisfy the selection condition, we are better off retrieving all records, sorting them, and then retaining the ones that satisfy the selection Insert For insert, we must find the correct place to insert, and then perform the actual insert. Using the alternative 2 unclustered tree index, our cost is + 3*D . The extra 3 term accounts for the following things *We must write the index page *We must read the data page *We must write the data page Delete Same as insert Operation Costs on Alt-2 Clustered Index For the calculations, we will use the following values: *B: The number of data pages *R: The number of records per page *D: (Average) time to read or write disk page *F: Fan out of the index organized file *Assume 67% occupancy for index pages *Assume 67% occupancy for data pages *Assume that index data entries are 1/3 the size of actual data entries *Assume that the index is on a heap file Scan Following the same idea as before, we use sequential scan on the data pages instead of using the index because it is cheaper. Using the new assumptions, we see that our cost is 1.5BD Equality Selection For equality selection, we end up with the same cost as in the unclustered example, namely a cost of ((log_F.5B) + 1)*D Range Selection Range selection for the clustered index is significantly better than unclustered because we do potentially need to perform an IO for each matching tuple in the range. Therefore, we see that our cost is + \textrm{num~matching~data~pages}*D Insert Insert costs the same as the unclustered version Delete Same as insert